Goto

Collaborating Authors

 expert-supervised reinforcement learning


Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Neural Information Processing Systems

Offline Reinforcement Learning (RL) is a promising approach for learning optimal policies in environments where direct exploration is expensive or unfeasible. However, the adoption of such policies in practice is often challenging, as they are hard to interpret within the application context, and lack measures of uncertainty for the learned policy value and its decisions. To overcome these issues, we propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and finally, 3) we propose a way to interpret ESRL's policy at every state through posterior distributions, and use this framework to compute off-policy value function posteriors. We provide theoretical guarantees for our estimators and regret bounds consistent with Posterior Sampling for RL (PSRL). Sample efficiency of ESRL is independent of the chosen risk aversion threshold and quality of the behavior policy.


Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Neural Information Processing Systems

With increasing success in reinforcement learning (RL), there is broad interest in applying these methods to real-world settings. This has brought exciting progress in offline RL and off-policy policy evaluation (OPPE).


Review for NeurIPS paper: Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Neural Information Processing Systems

Weaknesses: The empirically evaluation misses relevant baselines, making it quite hard to evaluate the usefulness of ESRL in comparison to prior approaches. The main algorithm (Algo 1) incorporates the use of majority voting and hypothesis testing in addition to learning multiple Q-estimates based on K sampled MDPs. Furthermore, based on the figure captions, K seems to be large (250 for Riverswim, 500 for Sepsis) and it seems unfair to use a single DQN model. A *naive* baseline would be to use the ensemble of these K Q-estimates and simply use their mean for selecting actions: this *quantifies* the empirical benefit from hypothesis testing. This should be discussed in the paper as well as empirically compared to as should be made as this is a simple way to incorporate value uncertainty in offline RL. 3. As mentioned in the paper, ESRL can deviate from the behavior policy when required or stick to it depending on the hypothesis testing.


Review for NeurIPS paper: Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Neural Information Processing Systems

This paper proposes an interesting way to use hypothesis testing as a solution to use expert knowledge for offline RL. The proposed approach is exciting and good enough to be published at NeurIPS. The experimental results are interesting, as well. However, the authors should address the concerns on the presentation and theoretical results raised by Reviewer 1 in the camera-ready version of the paper. At the very least, discussing it is the limitation of the approach in the paper's conclusion.


Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

Neural Information Processing Systems

Offline Reinforcement Learning (RL) is a promising approach for learning optimal policies in environments where direct exploration is expensive or unfeasible. However, the adoption of such policies in practice is often challenging, as they are hard to interpret within the application context, and lack measures of uncertainty for the learned policy value and its decisions. To overcome these issues, we propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and finally, 3) we propose a way to interpret ESRL's policy at every state through posterior distributions, and use this framework to compute off-policy value function posteriors. We provide theoretical guarantees for our estimators and regret bounds consistent with Posterior Sampling for RL (PSRL). Sample efficiency of ESRL is independent of the chosen risk aversion threshold and quality of the behavior policy.


Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

arXiv.org Artificial Intelligence

Offline Reinforcement Learning (RL) is a promising approach for learning optimal policies in environments where direct exploration is expensive or unfeasible. However, the adoption of such policies in practice is often challenging, as they are hard to interpret within the application context, and lack measures of uncertainty for the learned policy value and its decisions. To overcome these issues, we propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk aversion within the application context, and finally, 3) we propose a way to interpret ESRL's policy at every state through posterior distributions, and use this framework to compute off-policy value function posteriors. We provide theoretical guarantees for our estimators and regret bounds consistent with Posterior Sampling for RL (PSRL) that account for any risk aversion threshold. We further propose an offline version of PSRL as a special case of ESRL.